Functions are an extremely important way to write concise code whenever we need to repeat the same “set of operations” multiple times.
Example:
fahrenheit_to_celsius <- function(temp_F){
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}temp <- fahrenheit_to_celsius(32)
temp [1] 0We can compose the result of different functions
celsius_to_kelvin <- function(temp_C){
  temp_K <- temp_C + 273.15
  return(temp_K)
}The conversion from Fahreneit to Kelvin can be obtained via
celsius_to_kelvin(fahrenheit_to_celsius(100))[1] 310.9278You can set default inputs as
mySum <- function(num1, num2 = 10){
  out <- num1 + num2
  return(out)
}
mySum(2)[1] 12mySum(2, 5)[1] 7You can return multiple outputs as
my_operations <- function(num1, num2) {
  out <- num1 + num2
  out1 <- num1 * num2
  return(list('sum' = out, 'prod' = out1))
}
res <- my_operations(2, 5)
res$sum[1] 7res$prod[1] 10Remember the scoping rules: variables defined within a function are not visible outside of it. Conversely, the body of a function can see the variables that are outside of it.
Let us generate some data from a linear model.
set.seed(1234)
x <- rnorm(100)
y <- 1 + 2 * x + rnorm(100) # rnorm(100) is the additive noise
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)Simple linear fit: we use the lm() function
fit <- lm(y ~ x) # y = \beta_0 + \beta_1 * x
summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.88626 -0.61401  0.00236  0.58645  2.98774 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.0372     0.1050    9.88   <2e-16 ***
x             1.9739     0.1038   19.02   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.037 on 98 degrees of freedom
Multiple R-squared:  0.7869,    Adjusted R-squared:  0.7847 
F-statistic: 361.8 on 1 and 98 DF,  p-value: < 2.2e-16Remember how to interpret this table:
There are (at least) two ways to plot the regression line:
predict()abline() and input the regression coefficientsMethod 1:
xgrid <- seq(min(x), max(x), length.out = 100)
beta_hat <- coef(fit)
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
lines(xgrid, as.numeric(predict(fit, newdata = list(x = xgrid))), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)Method 2:
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
abline(coef(fit), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)The advantage of method 1 is that it generalizes to any statistical model.
Let’s take a look at the values of \(\hat{\beta}_0, \hat{\beta}_1\). We could have just computed them by hand, right? Remember that \[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}, \quad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}.\]
my_beta_hat <- c(mean(y) - cor(x, y) * sd(y)/sd(x) * mean(x), cor(x, y) * sd(y)/sd(x))
print(cbind(beta_hat, my_beta_hat))            beta_hat my_beta_hat
(Intercept) 1.037154    1.037154
x           1.973915    1.973915What happens if we replicate the data generating mechanism?
B <- 1000
betas <- array(NA, dim = c(B, 2))
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
for (b in 1:B){
  x <- rnorm(100)
  y <- 1 + 2 * x + rnorm(100)
  fit <- lm(y ~ x)
  betas[b,] <- coef(fit)
  abline(fit, col = alpha(cols[1], 0.05))
}More complicated models:
y ~ x         # y = beta_0 + beta_1 * x
y ~ 0 + x     # y = beta_1 * X
y ~ x1 + x2   # y = beta_0 + beta_1 * x1 + beta_2 * x2
y ~ x1 * x2   # y = beta_0 + beta_1 * x1 + beta_2 * x2 + beta_3 * x1 * x2
y ~ I(x / 2)  # y = beta_0 + beta_1 * (x / 2)Note that I evaluates argument as a regular R expression. Also, you should know that the lm function is not doing anything fancy. It is simply solving efficiently the normal equations and reporting a bunch of test statistics. In principle, you could code your own LinearModel function. You just have to remember that
\[\widehat{\boldsymbol{\beta}} = (X^{\intercal} X)^{-1} X^{\intercal} \mathbf{y},\]
which is the multivariate equivalent of the formulas that we have seen in class.
The goal is to use the dataset Boston to predict median house value using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status). We first start with a univariate regression in which the only covariate is lstat. As usual, let us start by loading the libraries that we will use.
rm(list=ls())
setwd("~/Desktop/SDS323 - Fa20/Laboratories/Lab 2 - Regression/")
library(ggplot2)
library(latex2exp)
library(RColorBrewer)
library(ISLR)
library(MASS)
library(car)
library(plotrix)
theme_set(theme_bw(base_size = 14))
cols <- brewer.pal(9, "Set1")names(Boston) [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"   ?Boston # get information on the data set
fit1 <- lm(medv ~ lstat, data = Boston) 
summary(fit1) # summary of the linear model fit is more useful
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16coef(fit1)(Intercept)       lstat 
 34.5538409  -0.9500494 confint(fit1) # CI for the intercept and slope                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505par(mar=c(4,4,2,2))
plotCI(x = 1:2, y = coef(fit1), ui = confint(fit1)[,2], li = confint(fit1)[,1], 
       lwd = 2, xaxt = 'n', xlab = '', ylab = 'value', pch = 16)
axis(1, at = 1:2, labels = c(TeX("$\\beta_0$"), TeX("$\\beta_1$")))
abline(h = 0, lwd = 2, lty = 2, col = cols[1])We can make prediction on new x values using CI or PI. Remember the difference?
predict(fit1, data.frame(lstat = c(5,10,15)), interval = "confidence")       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461predict(fit1, data.frame(lstat = c(5,10,15)), interval = "prediction")       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846Let us show the CI for the regression line, i.e. for the quantity \(\mathbb{E}[Y \mid X]\)
xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv') # scatter plot of medv and lstat
abline(fit1, col = cols[1], lwd = 3) # add the regression line in the scatter plot
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "confidence") 
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])), 
        col = alpha('gray', 0.5), border = alpha('gray', 0.5))It is widely different from the PI for a new observation, i.e. for the quantity \(Y^\star\).
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
abline(fit1, col = cols[1], lwd = 3)
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "prediction") 
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])), 
        col = alpha('gray', 0.5), border = alpha('gray', 0.5))We have already noted that some data points seem to be problematic.
par(mfrow = c(2,2))
plot(fit1) # make residual plot par(mfrow = c(1,1))
plot(hatvalues(fit1), pch = 16) # hat value is the measure for leverage
abline(h = length(coef(fit1))/nrow(Boston), lwd = 2, lty = 2, col = cols[1])Let us see how the fitted values compare to the observed ones
y_hat <- as.numeric(predict(fit1, data.frame(lstat = Boston$lstat)))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])The MSE (mean squared error) is
mean((y_hat - Boston$medv)^2)[1] 38.48297How would you compute the RSS?
Let us now use multiple predictors! Let’s first take a look at the data.
pairs(Boston)This does not help much, but hopefully the regression model can shed a light on some of these relationships. Before starting, we can take a look at the correlations in the data.
library(ggcorrplot)
ggcorrplot(round(cor(Boston), 2), hc.order = TRUE, type = "lower",
   lab = TRUE)In the multivariate case, there are two equivalent ways of writing a linear model with lm().
data = Boston) and use the regular syntax \(Y \sim X_1 + \dots + X_p\)model.matrix()fit2 <- lm(medv ~ lstat + age, data = Boston) # same lm command for simple or multiple linear regression
summary(fit2)
Call:
lm(formula = medv ~ lstat + age, data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16X <- model.matrix(medv ~ lstat + age, data = Boston)
head(X)  (Intercept) lstat  age
1           1  4.98 65.2
2           1  9.14 78.9
3           1  4.03 61.1
4           1  2.94 45.8
5           1  5.33 54.2
6           1  5.21 58.7Y <- Boston$medv
fit2 <- lm.fit(X, Y)
coef(fit2)(Intercept)       lstat         age 
33.22276053 -1.03206856  0.03454434 fit3 <- lm(medv ~ ., Boston) # . means all variables in the data set except the response
summary(fit3)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16par(mfrow = c(2,2))
plot(fit3)fit4 <- lm(medv ~ .- age, Boston) # the "-" means "do not include a particular variable"
summary(fit4)
Call:
lm(formula = medv ~ . - age, data = Boston)
Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16Let us see how the fitted values compare to the observed ones
y_hat <- as.numeric(predict(fit3, newdata = Boston))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])The MSE (mean squared error) is
mean((y_hat - Boston$medv)^2)[1] 21.89483Something we have not considered here but can be tried:
fit5 <- lm(medv ~ lstat * age, Boston) # * is for interaction term
summary(fit5)
Call:
lm(formula = medv ~ lstat * age, data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16fit6 <- lm(medv ~ lstat + I(lstat^2), Boston) # need to have I() to protect the square term
summary(fit6)
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16fit7 <- lm(medv ~ poly(lstat, 4, raw = T), Boston)
summary(fit7)
Call:
lm(formula = medv ~ poly(lstat, 4, raw = T), data = Boston)
Residuals:
    Min      1Q  Median      3Q     Max 
-13.563  -3.180  -0.632   2.283  27.181 
Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               5.731e+01  2.280e+00  25.134  < 2e-16 ***
poly(lstat, 4, raw = T)1 -7.028e+00  7.308e-01  -9.618  < 2e-16 ***
poly(lstat, 4, raw = T)2  4.955e-01  7.489e-02   6.616 9.50e-11 ***
poly(lstat, 4, raw = T)3 -1.631e-02  2.994e-03  -5.448 7.98e-08 ***
poly(lstat, 4, raw = T)4  1.949e-04  4.043e-05   4.820 1.90e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.28 on 501 degrees of freedom
Multiple R-squared:  0.673, Adjusted R-squared:  0.6704 
F-statistic: 257.8 on 4 and 501 DF,  p-value: < 2.2e-16xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
y_hat6 <- as.matrix(model.matrix(~ xgrid + I(xgrid^2))) %*% coef(fit6)
y_hat7 <- as.matrix(model.matrix(~ poly(xgrid, 4, raw = T))) %*% coef(fit7)
par(mfrow = c(1,1))
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
lines(xgrid, y_hat6, col = cols[1], lwd = 3)
lines(xgrid, y_hat7, col = cols[2], lwd = 3)
legend('topright', col = cols[1:2], legend = c('Quadratic','Fourth degree'), lwd = 3)In greenmoving.txt there is a list of travel times (in minutes) as a function of the length of the route [in m] and of the means of transportation used (walking, cycling, driving).
It is a categorical variable with three levels: we need two dummy variables.
green <- read.table('./Data/greenmoving.txt', header=T)
# Let us transform the transport variable in a factor!
green$transport <- factor(green$transport)
walk_idx <- as.numeric(green$transport == 'walk')
bike_idx <- as.numeric(green$transport == 'bike')
car_idx <- as.numeric(green$transport == 'car')
ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')The reference is going to be bike (first in alphabetical order in the factor)!
\[time = \beta_0 + \beta_1*car + \beta_2*walk + \beta_3*length + \beta_4*length*car + \beta_5*length*walk + \varepsilon\]
This is a model with interactions! Therefore we have a slope and an intercept for every level of transport.
Bike regression line: \(\beta_0 + \beta_3*length + \varepsilon\)
Car regression line: \((\beta_0 + \beta_1) + (\beta_3 + \beta_4)*length + \varepsilon\)
Walk regression line: \((\beta_0 + \beta_2) + (\beta_3 + \beta_5)*length + \varepsilon\)
fit <- lm(time ~ car_idx + walk_idx + length + length:car_idx + length:walk_idx, data = green)
summary(fit)
Call:
lm(formula = time ~ car_idx + walk_idx + length + length:car_idx + 
    length:walk_idx, data = green)
Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.5667208  1.3487377   3.386 0.000915 ***
car_idx          1.2107320  1.6665020   0.727 0.468705    
walk_idx        -1.1597079  1.9965712  -0.581 0.562250    
length           0.0052521  0.0013307   3.947 0.000123 ***
car_idx:length  -0.0002884  0.0013484  -0.214 0.830913    
walk_idx:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  geom_abline(intercept = sum(coef(fit)[1]), slope = sum(coef(fit)[4]), col = cols[1]) + 
    geom_abline(intercept = sum(coef(fit)[1:2]), slope = sum(coef(fit)[4:5]), col = cols[2]) + 
  geom_abline(intercept = sum(coef(fit)[c(1,3)]), slope = sum(coef(fit)[c(4,6)]), col = cols[3]) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')There is a much better way of doing this. Instead of creating the dummy variables by hand, we let R take care of it. We only have to declare that the variable transport is a factor!
fit <- lm(time ~ transport + length + length:transport, data = green)
summary(fit)
Call:
lm(formula = time ~ transport + length + length:transport, data = green)
Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 
Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.5667208  1.3487377   3.386 0.000915 ***
transportcar          1.2107320  1.6665020   0.727 0.468705    
transportwalk        -1.1597079  1.9965712  -0.581 0.562250    
length                0.0052521  0.0013307   3.947 0.000123 ***
transportcar:length  -0.0002884  0.0013484  -0.214 0.830913    
transportwalk:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16Some dummy variables do not seem to be significant. Let us verify it in a more precise way.
The means of transportation is significant if AT LEAST one of the coefficients related to the dummy variables is significantly different from 0, i.e. the null hypothesis is \[\beta_{1} = \beta_{2} = \beta_{4} = \beta_{5} = 0.\]
rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1))     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    1    0    0    0    0
[2,]    0    0    1    0    0    0
[3,]    0    0    0    0    1    0
[4,]    0    0    0    0    0    1linearHypothesis(fit,
                 rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0,0))Linear hypothesis test
Hypothesis:
transportcar = 0
transportwalk = 0
transportcar:length = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    148 2606.3                                  
2    144 1415.0  4    1191.3 30.309 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Thus, it appears that the transportation does influence the response.
The length of the route is significant if AT LEAST one of the coefficients where it appears is significantly different from 0, i.e. the null hypothesis is \[\beta_{3} = \beta_{4} = \beta_{5} = 0.\]
linearHypothesis(fit,
                 rbind(c(0,0,0,1,0,0),
                       c(0,0,0,0,1,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0))Linear hypothesis test
Hypothesis:
length = 0
transportcar:length = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    147 7452.2                                 
2    144 1415.0  3    6037.2 204.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Thus, it appears that the length of the route does influence the response.
Let’s look at the p-values for the individual tests!
summary(fit)
Call:
lm(formula = time ~ transport + length + length:transport, data = green)
Residuals:
    Min      1Q  Median      3Q     Max 
-8.0179 -2.0579  0.0629  2.1336  6.3213 
Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.5667208  1.3487377   3.386 0.000915 ***
transportcar          1.2107320  1.6665020   0.727 0.468705    
transportwalk        -1.1597079  1.9965712  -0.581 0.562250    
length                0.0052521  0.0013307   3.947 0.000123 ***
transportcar:length  -0.0002884  0.0013484  -0.214 0.830913    
transportwalk:length  0.0065389  0.0018779   3.482 0.000659 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared:  0.899, Adjusted R-squared:  0.8955 
F-statistic: 256.4 on 5 and 144 DF,  p-value: < 2.2e-16The summary is suggesting only one intercept and only two slopes (walk vs other transportation). Let’s test if we can block remove those three coefficients using as null hypothesis \[\beta_{1} = \beta_{2} = \beta_{5} = 0.\]
linearHypothesis(fit,
                 rbind(c(0,1,0,0,0,0),
                       c(0,0,1,0,0,0),
                       c(0,0,0,0,0,1)),
                 c(0,0,0))Linear hypothesis test
Hypothesis:
transportcar = 0
transportwalk = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    147 2296.7                                 
2    144 1415.0  3     881.7 29.91 4.315e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1We can reduce the model to \[time = \beta_0 + \beta_1*length + \beta_2*length*walk + \varepsilon\]
fit2 <- lm(time ~ length + length:walk_idx, data = green)
summary(fit2)
Call:
lm(formula = time ~ length + length:walk_idx, data = green)
Residuals:
    Min      1Q  Median      3Q     Max 
-8.1640 -1.9820  0.0539  2.0756  6.4508 
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.7579758  0.4581664   10.38   <2e-16 ***
length          0.0051614  0.0001438   35.90   <2e-16 ***
length:walk_idx 0.0054701  0.0004992   10.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.124 on 147 degrees of freedom
Multiple R-squared:  0.8976,    Adjusted R-squared:  0.8962 
F-statistic: 644.5 on 2 and 147 DF,  p-value: < 2.2e-16The three regression lines have the same intercept: 4.758.
The bike regression line and the car regression line have the same slope: 0.005.
The walk regression line has the slope: 0.0054701.
Car and bike have the same regression line.
ggplot() + 
  geom_point(aes(x = length, y = time, col = transport), data = green) + 
  geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2]), col = 'black') + 
  geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2:3]), col = cols[3]) + 
  labs(x = 'distance', y = 'time') + 
  scale_color_brewer(name = 'Transport', palette = 'Set1')The file smog.txt reports the concentrations of PM10 recorded by some weather stations in a town in Alaska between February 1 and February 10, 2010. The mayor of the town has forbidden the use of domestic heaters starting from February 7. City council requests some statistical analysis to study the efficacy of the ban on heaters.
estimate the parameters of the following models (\(t = 1\) denotes February 1):
smog <- read.table('./Data/smog.txt', header=T)
head(smog)  day station   PM10
1   1       1 41.605
2   1       2 41.622
3   1       3 43.427
4   1       4 42.276
5   1       5 42.583
6   2       1 44.219Let us try and write the possible models:
intervention <- ifelse(smog$day > 6.5, 1, 0)
par(mar=c(4,4,2,2), family = 'serif')
plot(smog$day, smog$PM10, pch = 16, col = alpha('black', 0.5), 
     xlab = 'day', ylab = 'PM10')
abline(v = 6.5, lty = 3)
fit1 <- lm(PM10 ~ day, smog)
fit2 <- lm(PM10 ~ day + I((day - 6.5) * intervention), smog)
fit3 <- lm(PM10 ~ day + intervention + I((day - 6.5) * intervention), smog)
summary(fit1)
Call:
lm(formula = PM10 ~ day, data = smog)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.4600 -0.8402 -0.1131  0.6881  3.5109 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.14209    0.38515  104.22   <2e-16 ***
day          2.07188    0.06207   33.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.261 on 48 degrees of freedom
Multiple R-squared:  0.9587,    Adjusted R-squared:  0.9578 
F-statistic:  1114 on 1 and 48 DF,  p-value: < 2.2e-16summary(fit2)
Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.2468 -1.0178  0.0062  0.7391  3.4297 
Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    39.6276     0.4811  82.364   <2e-16 ***
day                             2.2314     0.1107  20.150   <2e-16 ***
I((day - 6.5) * intervention)  -0.4539     0.2632  -1.724   0.0912 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9595 
F-statistic: 581.5 on 2 and 47 DF,  p-value: < 2.2e-16summary(fit3)
Call:
lm(formula = PM10 ~ day + intervention + I((day - 6.5) * intervention), 
    data = smog)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.33462 -0.67789 -0.03462  0.67975  2.63922 
Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    40.3889     0.4120  98.038  < 2e-16 ***
day                             1.9303     0.1058  18.247  < 2e-16 ***
intervention                    3.0407     0.5822   5.223 4.15e-06 ***
I((day - 6.5) * intervention)  -0.8555     0.2244  -3.812 0.000408 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9895 on 46 degrees of freedom
Multiple R-squared:  0.9756,    Adjusted R-squared:  0.974 
F-statistic: 613.4 on 3 and 46 DF,  p-value: < 2.2e-16day_grid <- seq(min(smog$day), max(smog$day), length.out = 200)
intervention_grid <- ifelse(day_grid > 6.5, 1, 0)
  
lines(day_grid, predict(fit1, newdata = list(day = day_grid)), col = cols[1], pch = 16, lwd = 2)
lines(day_grid, predict(fit2, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[2], pch = 16, lwd = 2)
lines(day_grid, predict(fit3, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[3], pch = 16, lwd = 2)
legend('topleft', legend = c(TeX('$C^1$'),TeX('$C^0$'),TeX('$C^{-1}$')), col = cols[1:3], lwd = 2)summary(fit2)
Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.2468 -1.0178  0.0062  0.7391  3.4297 
Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    39.6276     0.4811  82.364   <2e-16 ***
day                             2.2314     0.1107  20.150   <2e-16 ***
I((day - 6.5) * intervention)  -0.4539     0.2632  -1.724   0.0912 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9595 
F-statistic: 581.5 on 2 and 47 DF,  p-value: < 2.2e-16Remember the model: this corresponds to the hypothesis \(\beta_1 + \beta_2 \text{ (slope after intervention)} = \frac{1}{2} \beta_1 \text{ (slope before intervention)}\).
This corresponds to \(H_0: \beta_1 + 2 \beta_2 = 0\).
linearHypothesis(fit2, 
                 rbind(c(0,1,2)), 
                 0)Linear hypothesis test
Hypothesis:
day  + 2 I((day - 6.5) * intervention) = 0
Model 1: restricted model
Model 2: PM10 ~ day + I((day - 6.5) * intervention)
  Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
1     48 85.678                               
2     47 71.749  1    13.929 9.124 0.004071 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1We can also calculate the confidence interval for a linear combination of the parameters:
library(multcomp)
hyp_matrix <- array(0, dim = c(2, length(coef(fit2)))) # 2 is the number of CI to compute, length(coef(fit2)) is the model dimension
hyp_matrix[1,] <- c(0,1,0)
hyp_matrix[2,] <- c(0,1,1)
confcm <- glht(fit2, linfct = hyp_matrix)
confcm <- confint(confcm)
plotCI(x = 1:2, y = confcm$confint[,1], ui = confcm$confint[,3], 
       li = confcm$confint[,2], pch = 21, lwd = 2, xaxt = 'n', xlab = '', 
       ylab = 'value', col = cols[1])
axis(1, at = 1:2, labels = c(TeX('$\\beta_1$'), TeX('$\\beta_1 + \\beta_2$')))